Data Analysis - Velib Project in Python Python 


Authors: Amine Aziz Alaoui (IRT St-Exupéry), J. Chevallier (INSA Toulouse), J. Guérin (ANITI), Franck Kouassi (INSA Toulouse), O. Roustant (INSA Toulouse).

We consider the velib data set, related to the bike sharing system of Paris. The data are loading profiles of the bike stations over one week, collected every hour, from the period Monday 2nd Sept. - Sunday 7th Sept., 2014. The loading profile of a station, or simply loading, is defined as the ratio of number of available bikes divided by the number of bike docks. A loading of 1 means that the station is fully loaded, i.e. all bikes are available. A loading of 0 means that the station is empty, all bikes have been rent.

From the viewpoint of data analysis, the individuals are the stations. The variables are the 168 time steps (hours in the week). The aim is to detect clusters in the data, corresponding to common customer usages. This clustering should then be used to predict the loading profile.


The aim of this tutorial is to provide you a starting point for your project. Unsurprisingly, the first step is to get to grips with the dataset by exploring it through easy routines:

  • How are the data coded?
  • How many stations are observed?
  • What is the dispersion of the data?
  • etc.

You will find some suggested solutions in the "solutions" folder (we can certainly do better). I can only urge you to first try to answer the questions yourself, making sure you know which graph to use to answer the question, and then to look in the Python documentation to find out how to make a particular graph (there are lots of resources on the Internet for Python!). The counterpart to this tutorial, but in R, is also available on wikistat.

Entrée [3]:
Entrée [4]:

Preliminary: Load Data and Quality Assessment

Todo: Load the data
  • velibLoading.csv file.
  • velibCoord.csv file
  • Check that loading has gone smoothly by looking at the first lines of the notebooks.
Entrée [6]:
Entrée [49]:
Out[49]:
Lun-00 Lun-01 Lun-02 Lun-03 Lun-04 Lun-05 Lun-06 Lun-07 Lun-08 Lun-09 ... Dim-14 Dim-15 Dim-16 Dim-17 Dim-18 Dim-19 Dim-20 Dim-21 Dim-22 Dim-23
1 0.038462 0.038462 0.076923 0.038462 0.038462 0.038462 0.038462 0.038462 0.107143 0.000000 ... 0.296296 0.111111 0.111111 0.148148 0.307692 0.076923 0.115385 0.076923 0.153846 0.153846
2 0.478261 0.478261 0.478261 0.434783 0.434783 0.434783 0.434783 0.434783 0.260870 0.043478 ... 0.043478 0.000000 0.217391 0.130435 0.045455 0.173913 0.173913 0.173913 0.260870 0.391304
3 0.218182 0.145455 0.127273 0.109091 0.109091 0.109091 0.090909 0.090909 0.054545 0.109091 ... 0.259259 0.259259 0.203704 0.129630 0.148148 0.296296 0.314815 0.370370 0.370370 0.407407
4 0.952381 0.952381 0.952381 0.952381 0.952381 0.952381 0.952381 1.000000 1.000000 1.000000 ... 1.000000 1.000000 0.904762 0.857143 0.857143 0.857143 0.761905 0.761905 0.761905 0.761905
5 0.927536 0.811594 0.739130 0.724638 0.724638 0.724638 0.724638 0.724638 0.753623 0.971014 ... 0.227273 0.454545 0.590909 0.833333 1.000000 0.818182 0.636364 0.712121 0.621212 0.575758
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
1185 0.000000 0.000000 0.000000 0.000000 0.000000 0.045455 0.000000 0.090909 0.136364 0.000000 ... 0.043478 0.173913 0.043478 0.086957 0.086957 0.304348 0.304348 0.130435 0.086957 0.086957
1186 0.200000 0.133333 0.155556 0.177778 0.177778 0.177778 0.200000 0.177778 0.288889 0.511111 ... 0.266667 0.288889 0.155556 0.222222 0.333333 0.311111 0.355556 0.377778 0.333333 0.355556
1187 0.551724 0.517241 0.551724 0.517241 0.517241 0.551724 0.551724 0.448276 0.241379 0.034483 ... 0.482759 0.310345 0.000000 0.000000 0.103448 0.379310 0.310345 0.310345 0.344828 0.482759
1188 0.476190 0.428571 0.428571 0.428571 0.428571 0.428571 0.476190 0.523810 0.428571 0.476190 ... 0.880000 0.760000 0.750000 0.958333 1.000000 0.791667 0.791667 0.500000 0.434783 0.478261
1189 0.937500 0.968750 0.906250 0.875000 0.906250 0.906250 0.937500 0.937500 0.968750 1.000000 ... 1.000000 1.000000 0.687500 0.550000 0.950000 0.444444 0.526316 0.894737 0.947368 0.833333

1189 rows × 168 columns

Entrée [12]:
Entrée [13]:
Out[13]:
longitude latitude bonus names
1 2.377389 48.886300 0 EURYALE DEHAYNIN
2 2.317591 48.890020 0 LEMERCIER
3 2.330447 48.850297 0 MEZIERES RENNES
4 2.271396 48.833734 0 FARMAN
5 2.366897 48.845887 0 QUAI DE LA RAPEE
Question: Do these data sets contain missing data?
Entrée [19]:
longitude    0
latitude     0
bonus        0
names        0
dtype: int64
0
Entrée [21]:
--- Loading ---
0

--- Coord ---
longitude    0
latitude     0
bonus        0
names        0
dtype: int64
Question: Do these data sets duplicate data?
Entrée [22]:
--- Loading ---
0

--- Coord ---
0
Entrée [26]:
--- Loading ---
0

--- Coord ---
0
Question: Are any stations present more than once in the data set?
  • You can use the value_counts() function to count the number of occurrences of each function name.
  • Discuss this result in the light of the previous question. If the answer is yes, we could, for example, try to visualize the different entries for the same station.
Entrée [30]:
1                   EURYALE DEHAYNIN
2                          LEMERCIER
3                    MEZIERES RENNES
4                             FARMAN
5                   QUAI DE LA RAPEE
                    ...             
1185            CHAPELLE MARX DORMOY
1186                           DUROC
1187     GEORGES MESSIER (MONTROUGE)
1188              VORGES (VINCENNES)
1189                   QUAI VOLTAIRE
Name: names, Length: 1189, dtype: object
Out[30]:
 PORTE DES LILAS                 3
 GARE D'AUSTERLITZ               3
 PORTE DE BAGNOLET               2
 CHERCHE MIDI                    2
 LEGENDRE                        2
                                ..
 BEL AIR                         1
 ASSAS LUXEMBOURG                1
 COURS DE VINCENNES BD DAVOUT    1
 RUISSEAU ORDENER                1
 QUAI VOLTAIRE                   1
Name: names, Length: 1161, dtype: int64
Entrée [32]:
 PORTE DES LILAS           3
 GARE D'AUSTERLITZ         3
 GARE DE L'EST             2
 AQUEDUC                   2
 DODU                      2
                          ..
 CHARONNE                  1
 BOUSSINGAULT - TOLBIAC    1
 RIVOLI MAIRIE DU 1ER      1
 JOURDAN BARBOUX           1
 QUAI VOLTAIRE             1
Name: names, Length: 1161, dtype: int64

Out[32]:
longitude latitude bonus names
362 2.404770 48.876604 1 PORTE DES LILAS
450 2.405960 48.875412 1 PORTE DES LILAS
957 2.411046 48.878099 1 PORTE DES LILAS

First Insights into the Dataset

Todo: Plot the loading a station
  • Plot the load evolution of the i-th station over time;
  • Draw a vertical line to delimit the days (Hint: How many days do we observe?);
  • Enter the station name in the figure title;
  • Label the axes in the figure.
Entrée [50]:
Entrée [35]:

0 veut dire que tous les vélos sont pris

Question: Does loading differ from one station to another?

Draw a matrix of plots of size 4*4 corresponding to 16 stations of your choice. Do not forget the vertical lines corresponding to days

Entrée [64]:
Entrée [53]:

Comments?

Todo: Draw the boxplot of the variables, sorted in time order.
  1. What can you say about the distribution of the variables?
  2. Position, dispersion, symmetry?
  3. Can you see a difference between days?

Hint: To change the graphical properties of boxplots (for example, the thickness of the median), use the patch_artist = True argument in the plt.boxplot function.

Entrée [73]:
Entrée [ ]:

Comments? On voit le profil de chargement au cours de la semaine (jour par jour)

Average Loading

Question: What is the average station fill rate?

Which station is, on average, the fullest? the least full?

Entrée [89]:
--- Average fill rate ---
0.3816217759807477

--- Least crowded station, on average ---
Average fill rate : 0.016132842025699153
longitude              2.427934
latitude              48.873929
bonus                         1
names         HORNET (BAGNOLET)
Name: 997, dtype: object

--- Fullest station, on average ---
Average fill rate : 0.9193722943722953
longitude                          2.398262
latitude                           48.81466
bonus                                     0
names         INSURRECTION AOUT 1944 (IVRY)
Name: 1107, dtype: object
Out[89]:
[Ellipsis]
Entrée [ ]:
Entrée [79]:
--- Average fill rate ---
0.3816217759807477

--- Least crowded station, on average ---
Average fill rate : 0.016132842025699153
longitude              2.427934
latitude              48.873929
bonus                         1
names         HORNET (BAGNOLET)
Name: 997, dtype: object

--- Fullest station, on average ---
Average fill rate : 0.9193722943722953
longitude                          2.398262
latitude                           48.81466
bonus                                     0
names         INSURRECTION AOUT 1944 (IVRY)
Name: 1107, dtype: object
Question: Does the average load vary from one station to another?
  • Show the evolution of the average load for each station.
  • On the same graph, plot the average loading for the entire data set.
Entrée [105]:
Out[105]:
<matplotlib.collections.LineCollection at 0x7f1af020d810>
Entrée [ ]:

Comments?

Question: Does the average load vary over the course of a day?

Plot the average hourly loading for each day (on a single graph).

Entrée [121]:
Out[121]:
[<matplotlib.lines.Line2D at 0x7f1aeeeb2350>]
Entrée [119]:

Comments?

Velib Station Map

Entrée [115]:
Question: Where are the velib stations located?
  • Plot the stations coordinates on a 2D map (latitude vs. longitude)
  • Use the average hourly loading as a color scale
  • You can consider different times of day, for example 6am, 12pm, 11pm on Monday, or the average weekly load at 6am.
  • You can consider different days at the same time, or the average load for each day.
  • You can use the scatter_mapbox function of the plotly.express to charge the map of Paris
Entrée [ ]:
Entrée [117]:

Comments?

Entrée [ ]:
Entrée [123]:

Comments?

Entrée [ ]:
Entrée [125]:
© Carto © OpenStreetMap contributors
00.20.40.60.8colorStations loading - Weekly average at 18 h

Comments?

Influence of Altitude Difference on Station Loading

Question: Does Paris have many hilltop stations?
  • Compare the number of hilltop stations with the others.
Entrée [ ]:
Entrée [ ]:
Question: Are hilltop stations more crowded than others?
  • Plot the stations coordinates on a 2D map (latitude vs. longitude), using a different color for stations which are located on a hill.
  • Redo the initial study, but distinguish hilltop stations from others.
Entrée [ ]:
Entrée [ ]:
Entrée [ ]:
Entrée [ ]:
Entrée [ ]:
Entrée [ ]:
Entrée [ ]:
Entrée [ ]: